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MULTIMEDIA INTEGRATION DESCRIPTION SCHEME, METHOD AND SYSTEM 

FOR MPEG-7 



This non-provisional application claims the benefit of U.S. Provisional 
AppUcation No. 60/118,022. filed February 1^99 ^ 

This application includes an Appendix containing computer code that performs 
content description in accordance with the exemplary embodiment of the present 
invention. That Appendix of the disclosure of this patent document contains material 
which is subject to copyright protection. The copyright owner has no objection to the 
facsimile reproduction by any-one of the patent document or the patent disclosure, as it 
appears in the Patent and Trademark Office patent file or records, but otherwise reserves 
all copyright rights whatsoever. 

BACKGROUND OF THE INVENTION 

1. Field of Invention 

The present invention generally relates to audiovisual data representation. More 
particularly, this invention relates to integrating the descriptions of multiple categories of 
audiovisual content to allow such content to be searched or browsed with ease in digital 
libraries, Intemet web sites and broadcast media, for example. 

2. Description of Related Art 

More and more audiovisual information is becoming available firom many sources 
around the world. Such information may be represented by various forms of media, such 
as still pictures, video, graphics, 3D models, audio and speech. In general, audiovisual 
information plays an important role in our society, be it recorded in such media as fihn or 
magnetic tape or originating, in real time, from some audio or visual sensors, be it 
analogue or, increasingly, digital. 

While audio and visual information used to be consumed directly by the human 
being, computational systems are increasingly creating, exchanging, retrieving and re- 
processing this audiovisual information. Such is the case for image understanding, e.g., 
surveillance, intelligent vision, smart cameras, etc., media conversion, e.g., speech to text, 
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picture to speech, speech to picture, etc., information retrieval, e.g., quickly and 
efficiently searching for various types of multimedia documents of interest to the user, 
and filtering to receive only those multimedia data items which satisfy the user's 
preferences in a stream of audiovisual content. 

For example, a code in a television program triggers a suitably progranuned VCR 
to record that program, or an image sensor triggers an alarm when a certain visual event 
happens. Automatic transcoding may be performed based on a string of characters or 
audible information or a search may be performed in a stream of audio or video data. In 
all these examples, the audiovisual information has been suitably "encoded" to enable a 
device or a computer code to take some action. 

In the infancy of web-based information communication and access systems, 
information is routinely transferred, searched, retrieved and processed. Presently, much 
of the information is predominantly represented in text form. This text-based information 
is accessed using text-based search algorithms. 

However, as web-based systems and multimedia technology continue to improve, 
more and more information is becoming available in a form other than text, for instance 
as images, graphics, speech, animation, video, audio and movies. As the volume of such 
information is increasing at a rapid rate it is becoming important to be easily to be able to 
search and retrieve a specific piece of information of interest. It is often difficult to search 
for such information by text-only search. Thus the increased presence of multimedia 
information and the need to be able to find the required portions of it in an easy and 
reliable manner, irrespective of the search engines employed, has spurred on the drive for 
a standard for accessing such information. 

The Moving Pictures Expert Group (MPEG) is a working group under the 
International Standards Organization/International Electrotechnical Commission in 
charge of the development of intemational standards for compression, decompression, 
processing and coded representation of video data, audio data and their combination. 
MPEG previously developed the MPEG-1, MPEG-2 and MPEG-4 standards, and is 
presently developing the MPEG-7 standard, formally called "Multimedia Content 
Description Interface", hereby incorporated by reference in its entirety. 
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MPEG-7 will be a content representation standard for multimedia information 
search and will include techniques for describing individual media content and their 
combination. Thus, MPEG-7 standard is aiming to providing a set of standardized tools 
to describe multimedia content!^Therefore, the MPEG-7 standard, unlike the MPEG-1, 
5 MPEG-2 or MPEG-4 standards, is not a media content coding or compression standard 
but rather a standard for representation of desertions of media content. The data 
representing descriptions is called "meta data". Thus, irrespective of how the media 
content is represented, i.e., analogue, PCM, MPEG-1, MPEG-2, MPEG-4, Quicktime, 
Windows Media etc, the meta data associated with this content, may in future, be MPEG- 
10 7. 

Often, the value of multimedia information depends on how easily it can be 
^ found, retrieved, accessed, filtered and managed. In spite of the fact that users have 

rp increasing access to this audiovisual information, searching, identifying and managing it 

efficiently is becoming more difficult because of the sheer volimie of the information. 
1 5 Moreover, the question of identifying and managing multimedia content is not just 
\n restricted to database retrieval applications such as digital libraries, but extends to areas 

Q such as broadcast channel selection, multimedia editing and multimedia directory 

services. 

Although techniques for tagging audiovisual information allow some limited 
(□ 20 access and processing based on text-based search engines, the amount of information that 

y 

may be included in such tags is somewhat limited. For example, for movie videos, the 
tag may reflect name of the movie or list of actors etc., but this information may apply to 
the entire movie and may not be sub-divided to indicate the content of individual shots 
and objects in such shots. Moreover, the amount of information that may be included in 
25 such tags and architecture for searching and processing that information is severely 
limited. 

SUMMARY OF THE INVENTION 
The invention provides a system and method for integrating multimedia 
descriptions in a way that allows humans, software components or devices to easily 
30 identify, manage, manipulate, and categorize the multimedia content. In this manner, a 
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user who may be interested in locating a specific piece of multimedia content fi-om a 
database, Intemet, or broadcast media, for example, may search for and find the of the 
multimedia content. 

In this regard, the invention provides a system and method that receives 
5 multimedia content from a multimedia stream and separates the multimedia content into 
separate components which are assigned to single media categories, such as image, video, 
audio, synthetic audiovisual, and text. Within each of the single media categories, media 
events are classified and descriptions of such single media events are generated. These 
descriptions are then integrated and formatted, according to a multimedia integration 

10 description scheme. Multimedia integration description is then generated for the 
multimedia content. The multimedia description is then stored into a database. 

As a result, a user may query a search engine which then retrieves the multimedia 
integration description from the database. The search engine can then provide the user a 
usefial search result whose multimedia integration description meets the query 

15 requirements. 

The exemplary embodiment of the invention addresses the draft requirements of 
MPEG-7 promulgated by MPEG at the time of the filing of this patent application. That 
is, the invention provides o bject-orient ed, g eneric abstractio n and uses o bjects and eve nts 
as fiindamental entities f or description . Thus, the invention provides an efficient 
20 framework for description of various types of multimedia data. 

The invention is also a comprehensive tool for describing multimedia data 
because it uses extensible Markup Language (XML), which is self describing. The 
present invention also provides flexibihty because parts can be instantiated so as to 
provide efficient organization. The invention also provides extensibility and the ability to 
25 define relationships between data because elements defined in the description scheme can 
be used to derive new elements. 

These and other features and advantages of this invention are described in or are 
apparent from the following detailed description of the system and method according to 
this invention. 
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BRIEF DESCRIPTION OF THE DRAWINGS 
The preferred embodiments of the invention will be described in detail with reference to 
the following figures wherein: 

Fig. 1 A is an exemplary block diagram showing a multimedia integration system; 
5 Fig. IB is an exemplary block diagram of an exemplary individual media type 

descriptor generation unit shown in Fig. 1 A; 

Fig. 2 is an exemplary block diagram of the multimedia integration description scheme 
unit in Fig. lA; 

Fig. 3 is an example of a multimedia stream, consisting of multimedia objects and 
10 single-media objects, and the relationship among these objects; 

Fig. 4 is a UML representation of the multimedia description scheme at the multimedia 
□ stream level which consists of one or more multimedia objects; 

g Fig. 5 is a UML representation of the multimedia description scheme at the 

multimedia object level; and 
1 5 Fig. 6 is an exemplary flowchart showing the process of generating integrated 

multimedia content description for multimedia content. 

DETAILED DESCRIPTION OF PREFERRED EMBODIMENTS 
Prior to explaining the exemplary embodiment of the invention, a synopsis of 
MPEG-7 is provided to aid in the reader's understanding of how the exemplary 
20 embodiment processes multimedia data within the construct of MPEG-7. 

MPEG-7 is the result of a global demand that has logically followed the 
increasing availability of digital audiovisual content. Audiovisual information, both 
natural and synthetic, will continue to be increasingly available from many sources 
around the world. Also, users want to use this audiovisual information for various 
25 purposes. However, before the information can be used, it must be identified, located, 

indexed, and even characterized properly. At the same time, the increasing availability of 
potentially interesting material makes searching more difficult because of the increasingly 
voluminous pool of information to be searched. 

MPEG-7 is directed at standardizing the interface for describing multimedia 
30 content to allow efficient searching and retrieval for various types of multimedia material 
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interesting to the user. MPEG-7 is meant to provide standardization of multimedia 
content descriptions. MPEG-7 expects to extend the limited capabilities of proprietary 
solutions in identifying content that exist today, notably by including more data types. In 
other words, MPEG-7 will specify a standard set of d escriptor s that can be used to 
5 describe various types of multimedia information. MPEG-7 will also specify predefined 
structures of descriptors and their relationships,^ well as ways to define one's ovm 
structures. These structures are called d escription schemes (DS s). Defining new 
description schemes can be performed using a special language, the description definition 
language (or DDL), which is also a part of the MPEG-7 standard. The description, i.e., a 

10 set of instantiated description schemes, is associated with the content itself to allow fast 
and efficient searching for material of a user's interest. MPEG-7 will also include coded 
representations of a description for efficient storage, or fast access. 

Conventionally, search engines each have individual syntax formats that differ. 
These differences in syntax format cause compatibility issues between search criteria, 

15 e.g., identical criteria used by different engines results in different results. With the use 
of description schemes under MPEG-7, these search engines will be able to process 
MPEG-7 multimedia contents regardless of the differing syntax formats to produce the 
same results. 

The requirements of MPEG-7 apply, in principle, to both real time and non-real 
20 time applications. Also, MPEG-7 will apply to push and pull applications. However, 
MPEG-7 will not standardize or evaluate applications. Rather, MPEG-7 will interact 
with many different applications in many different environments, which means that it will 
need to provide a flexible and extensible fi-amework for describing multimedia data. 

Therefore, MPEG-7 will not define a monolithic system for content description. 
25 Rather, MPEG-7 will define a set of methods and tools for describing multimedia data. 
Thus, MPEG-7 expects to standardize a set of descriptors, a set of description schemes, a 
language to specify description schemes (and possibly descriptors), e.g., the description 
definition language, and one or more ways to encode descriptions. A starting point for the 
description definition language is the XML, although it is expected that the basic XML 
30 will eventually need to be customized and modified for use in MPEG-7. 



The exemplary embodiment of the invention described herein with reference to 
Figs. 1 A-6 conforms to the requirements of the MPEG-7 standard, in its present form. 

The following description of the particular embodiment of the invention uses 
terminology that is consistent with definitions provided in the MPEG-7 standard. The 
term "data" indicates audiovisual information that is described using MPEG-7, regardless 
of storage, coding, display, transmission, medium or technology. Data encompasses, for 
example, graphics, still images, video, film, music, speech, sounds, text and any other 
relevant audiovisual medium. Examples of such data may be found in, for example, an 
MPEG-4 stream, a video tape, a compact disc containing music, sound or speech, a 
picture printed on paper or an interactive multimedia installation on the web. 

A "feature" indicates a distinctive characteristic of the data which signifies 
something to someone. Examples of features include image color, speed pitch, audio 
segment rhythm, video camera motion, video style, movie title, actors' names in a movie, 
etc. Examples of features of visual objects include shape, surface, complexity motion, 
light, color, texture, shininess and transparency. 

A "descriptor" is a representation of a feature. It is possible to have several 
descriptors representing a single feature. A descriptor defines the syntax and the 
semantics of the feature representation and allows the evaluation of the corresponding 
feature via the descriptor value. Examples of such descriptors include color histogram, 
frequency component average, motion field, title text, etc. 

A "descriptor value" is an instantiation of a descriptor for a given data set. 
Descriptor values are combined using a description scheme to form a description. 

A "description scheme", specifies the structure and semantics of relationships 
between its components, which may be both descriptors and description schemes. The 
distinction between a description scheme and a descriptor is that a descriptor contains 
only basic data types, as provided by the description definition language. A descriptor 
also does not refer to another descriptor or description scheme. 

A "description" is the result of instantiating a description scheme. To instantiate a 
description scheme, a set of descriptor values that describe the data is structured 
according to a description scheme. Depending on the completeness of the set of 
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descriptor values, the description scheme may be fully or partially instantiated. 
Additionally, it is possible that the description scheme may be merely incorporated by 
reference in the description rather than being actually present in the description. A 
"coded description" is a description that has been encoded to fulfill relevant requirements 
5 such as compression efficiency, error resilience, random access, etc. 

The "description definition language (DDL)" is the language that allows the 
creation of new description schemes and, possibly, new descriptors. The description 
definition language also allows the extension and modification of existing description 
schemes. 

10 MPEG-7 data may be physically located with the associated audiovisual material, 

in the same data stream or on the same storage system, but the descriptions can also be 
13 stored elsewhere. When the content and its descriptions are not co-located, mechanisms 

g that link audiovisual material and their MPEG-7 descriptions are needed; these links must 

work in both directions. 
15 The exemplary embodiment meets the present MPEG-7 requirements outlined in 

Pi the present draft of MPEG-7 standard requirements. Requirements include criteria 

=^ relating to descriptors, description scheme requirements, the description definition 

Z language requirements and system requirements. While the exemplary embodiment of 

^ the invention should satisfy all requirements of MPEG-7 when taken as a whole, not all 

5 20 requirements have to be satisfied by each individual descriptor or description scheme. 

The descriptor requirements include cross-modahty, direct data mampulation, data 
adaptation, language of text-based descriptions, linking, prioritization of related 
information and unique identification. Description scheme requirements include 
description scheme relationships, descriptor prioritization, descriptor hierarchy, descriptor 
25 scalability, temporal range description, data adaptation, compositional capabilities, 

unique identification, primitive data types, composite data types, multiple media types, 
various types of description scheme instantiations, relationships within a description 
scheme and between description schemes, relationship between description and data, 
links to ontologies, platform independence, grammar, constraint validation, intellectual 
30 property management and protection, human readability and real time support. 




9 



While a description scheme can be generated using any description definition 
language, the exemplary embodiment of the invention uses extensible Markup Language 
(XML) to represent the integration description scheme. XML is a useful subset of 
SGML. XML is easier to learn, use and implement than SGML. XML allows for self 
5 description, i.e., allows description and structure of description in the same format and 
document. Use of XML also allows linking of collections of data by importing external 
document type definitions using description schemes. 

Additionally, XML is highly modular and extensible. XML provides a self 
describing and extensible mechanism, and although not media centric, can provide a 

10 reasonable starting basis. Another major advantage of using XML is that it allows the 
descriptions to be self-describing, in the sense that they combine the description and the 
structure of the description in the same format and document. XML also provides the 
capability to import extemal document type definitions (or DTDs), e.g., for feature 
descriptors, into the image description scheme document type definitions in a highly 

1 5 modular and extensible way. 

According to the exemplary embodiment of the invention, each multimedia 
component description can include multimedia component objects. Each multimedia 
component object has one or more associated multimedia component features. The 
multimedia component features of an object are grouped together as being visual, audio 

20 or a relationship on semantic or media. In the multimedia component description 
scheme, each feature of an object has one or more associated descriptors. 

The multimedia description scheme also includes specific document type 
definitions, also generated using the XML fi-amework, to provide example descriptors. 
The document type definition provides a list of the elements, tags, attributes, and entities 

25 contained in the document, and their relationships to each other. Document type 
definitions specify a set of rules for the structure of a document. For example, a 
document type definition specifies the parameters needed for certain kinds of documents. 
Using the multimedia description scheme, document type definitions may be included in 
the file that contain the document they describe. In such a case, the document type 

30 definition is included in a document's prolog after the XML declaration and before the 
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actual document data begins. Alternatively, the document type definition may be linked 
to the file fi'om an external URL. Such external dociunent type definitions can be shared 
by different documents and Web sites. In such a way, for a given descriptor, the 
multimedia description scheme can provide a link to external descriptor extraction code 
5 and descriptor similarity code. 

Audiovisual material that has MPEG-7 data associated with it, may include still pictures, 
gr^hics, 3D models, audio, speech, video, and information about how these elements are 
combined in a multimedia presentation, i.e., scenario composition information. A special case 
of this general data type may include facial expressions and personal characteristics. 
10 Fig. 1 A is a block diagram of an exemplary multimedia description integration 

system 100. The multimedia integration description system 100 includes global media 
!□ description unit 110, local media description unit 120, integration descriptors unit 150, 

multimedia integration description scheme imit 160, and multimedia integration 
1^ description generator 165. 

15 The global media description unit 110 includes a global description generation 

jJl unit 115 which receives multimedia content and provides global descriptions to the 

i3 integration descriptors unit 150. The global descriptions provided to the integration 

!^ descriptors unit 150 are descriptions that are relevant to the multimedia content as a 

H whole, such as time, duration, space, etc. The local media description unit 120 includes 

O 

Q 20 description generation units for various categories of multimedia content including image 
description generation unit 125, video description generation unit 130, audio description 
generation unit 135, synthetic audiovisual description generation unit 140 and text 
description generation unit 145. 

While Fig. 1 A illustrates the relationship between the integration description 
25 scheme and five categories of single media descriptions, one skilled in the art may 
appreciate that these categories are exemplary and therefore, may be subdivided or 
reclassified into a greater or lesser number of categories. In that regard, the exemplary 
embodiment illustrated in Fig. 1 A illustrates how multimedia content can be divided and 
categorized into 5 various descriptions categories. 
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The multimedia integration description scheme unit 160 contains description 
schemes for integrating one or more of the description categories. In other words, the 
multimedia integration description scheme unit 160 maps how image, video, audio, 
synthetic and/or text descriptions, as descriptions of the component objects of a 
5 multimedia object, should be combined to form the description of the composite 
multimedia object and to be stored for easy retrieval. The multimedia integration 
description scheme unit 160 also provides input to the integration descriptors unit 150 so 
that it can provide proper descriptor values to the multimedia integration description 
generator 165. 

10 The multimedia integration description generator 165 generates a multimedia 

integration description based on the one or more of the image, video, audio, synthetic 
and/or text descriptions received from the description generation units 125-145, the 
descriptor values received from the integration descriptors unit 150 and the multimedia 
integration description scheme received from the multimedia integration description 

15 scheme unit 160. The multimedia integration description generator 165 generates the 
multimedia integration description and stores the description in the database 170. 

Once the multimedia integration description has been stored, a user terminal 180, 
for example, may request multimedia content from a search engine 175. The search 
engine 175 then retrieves the multimedia content descriptions, whose multimedia 

20 integration descriptions meet what the user requested, from the database 1 70 and provides 
the retrieved multimedia content descriptions to the user at terminal 180. 

Fig. IB is an exemplary block diagram of one of the individual media type 
descriptor generation units (i.e., the image, video, audio, synthetic and text description 
generation units 125-145) shown in Fig. 1 A. The individual media type description 

25 generation unit 121 includes a feature extractor and descriptors representation unit 122, 
an individual media type content description generator 123, and individual media type 
description scheme unit 124. 

The feature extractor and descriptors representation imit 122 receives individual 
media type content and extract features from the content. The extracted features are 

30 represented by descriptor values which are output to the individual media type content 




12 



description generator 123. The individual media type content description generator 123 
uses the individual media type description scheme provided by the individual media type 
description scheme unit 124 and the descriptor values provided by feature extractor and 
descriptors representation unit 122 to output the content description which is sent to the 
5 muhimedia integration description generator 165, shown in Fig. 1 A. 

Fig. 2 shows the muhimedia integration scheme unit 160. The multimedia 
integration scheme xmit 160 interacts with a global media description scheme 115, an 
image description scheme 220, a video description scheme 230, an audio description 
scheme 240, a synthetic audiovisual description scheme 250 and a text description 

10 scheme 260, which provide description schemes inputs to the multimedia integration 
description scheme 210. The multimedia integration description scheme 210 also 
receives input from integration descriptors 205. 

In this manner, the integration descriptors 205 and the description 
schemes 215-260 provide i ndividual maps for ea ch_cat.egorv of multimedia content. The 

15 description schemes 220-260 provide description schemes for individual multimedia 
categories which are integrated i nto a composite multimedia description scheme 2 10. 
The multimedia integration description scheme 210 is used by the multimedia integration 
description generator 165 to generate a multimedia integration description for the 
multimedia content which is then stored in the database 1 70 for future retrieval by search 

20 engine 175. 

The multimedia integration description scheme (or MMDS) 210 describes 
multimedia content which may contain composing data from dif ferent media type , such 
as images, natural video, audio, synthetic video, and text. The multimedia integration 
description scheme 210 is configured to meet the requirements for multimedia integration 
25 description schemes specified by MPEG-7, for example, and is independent of any 

description definition language. The multimedia integration description scheme 210 is 
also configured to achieve the maximimi synergy with the separate description schemes, 
such as the image, video, audio, synthetic, and text description schemes 220-260. 
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In the multimedia integration description scheme 210, a multimedia stream is 
represented as a set of relevant multimedia objects that can be further organized by using 
object hierarchies. Relationships among multiple multimedia objects that can not be 
expressed using a tree structure are described using entity relation graphs. Multimedia 
5 objects can include multiple features, each of which can contain multiple descriptors. 
Each descriptor can link to external feature extraction and similarity matching code. 
Features are grouped according to the following categories: media features, semantic 
features, and temporal features. 

At the same time, each multimedia object includes a set of single-media objects, 
10 which together form the multimedia object. Single-media objects are associated with 
features, hierarchies, entity relation graphs, and multiple abstraction levels, as described 
by single media description schemes (image, video, etc.). Multimedia objects are an 
association of multiple single-media objects, for example, a video object corresponding to 
a person, an audio object corresponding to his speech and the text object corresponding to 
15 the transcript. 

The multimedia integration description scheme 210 includes the flexibility of the 
object-oriented framework which is also found in the individual media description 
schemes. The flexibility is achieved by (1) allowing parts of the description scheme to be 
instantiated; (2) using efficient categorization of features and clustering of objects (using 
20 the indexing hierarchy, for example); and (3) supporting efficient linking, embedding, or 
downloading of external feature descriptor and execution codes. 

Elements defined in the multimedia integration description scheme 210 can be 
used to derive new elements for different domains. As mentioned earlier, it has been 
used in description scheme for specific domain (e.g., home media). 

25 One unique aspect of the multimedia integration description scheme 210 is the 

capability to define multiple abstraction levels based on any arbitrary set of criteria. The 
criteria can be specified in terms of visual features (e.g., size), semantic relevance (e.g., 
relevance to user interest profile), or service quality (e.g., media features). 
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The multimedia integration description scheme 210 aims at describing multimedia 
content resulting from integration of multiple media streams. Examples of individual 
media streams are images, audio sequences, natural video sequences, synthetic video 
sequences, and text data. An example of such integrated multimedia stream is a 
television program that includes video (both natural and synthetic), audio, and text 
streams. 

Under the multimedia integration scheme, a multimedia stream is represented as a 
set of multimedia objects that include objects from the composing media streams. 
Multimedia objects are organized in object hierarchies or in entity relation graphs. 
Relationships among two or more multimedia objects that can not be expressed in a tree 
structure can be described using multimedia entity relation graphs. The tree structures 
can be efficiently indexed and traversed, while the entity relation graphs can model 
general relationships. 

The multimedia integration description scheme 210 builds on top of the individual 
media description schemes, including the image, video, audio, synthetic and text 
description schemes. All elements and structures used in the multimedia integration 
description scheme 210 are intuitive extensions of those used in individual media 
description schemes. 

Fig. 3 is the example showing the basic elements and structures of the multimedia 
integration description scheme 210. The explanation will include example XML with the 
specific document type definition declarations included in Appendix A. A more 
complete listing of the XML description of the multimedia stream in Fig. 3 is included in 
Appendix B. 

Figs. 4 and 5 show the graphical representation of the proposed multimedia 
description scheme following the UML notations. Figs.4 and 5 clearly show the 
relationships of the multimedia description scheme 210 with description schemes 220- 
260 for individual media. It should be emphasized that the same structure is used at the 
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multimedia object level and the single-media object level for the description schemes of 
the individual media: image, video, etc. 

The multimedia stream element (<MM_Stream>) refers to the multimedia content 
being described. The multimedia stream is represented as one set of multimedia objects 
(<MM_Object_Set>, zero or more object hierarchies (<Object_Hierarchy>), and zero or 
more entity relation graphs (<Entity_Relation_Graph>). Each one of these elements is 
described in detail below. 

The multimedia stream element can include a unique identifier attribute ID. 
Descriptions of archives containing multimedia streams will use these DDs to reference 
multimedia streams. 

An example of use a multimedia stream element is expressed in XML follows: 

<!-- A multimedia Stream --> 
<MM_Stream id="mmstream1"> 

<!" One multimedia object set --> 

<MM_Object_Set> </MM_Object_Set> 

<!-- Multiple object hierarchies -> 

<Object_Hierarchy> </Object_Hierarchy> 

<!-- Multiple entity relation graphs --> 
<Entity_Relation_Graph> </Entity_Relation_Graph> 

</MM_Stream> 

The basic description element of the multimedia description scheme is the 
multimedia object element (<MM_Object>). The set of all the multimedia objects in a 
multimedia stream is included within the muUimedia object set (<MM_Object_Set>). 

A multimedia object element includes a collection of single-media objects from 
one or more media streams that together form a relevant entity for searching, filtering, or 
presentation. These single-media objects may be from the same media stream or different 
media streams. Usually, single-media objects are from different media (e.g., audio, 
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video, and text). These elements are defined in the description scheme of the 
corresponding media and may have associated hierarchies or entity relation graphs. For 
purposes of discussion, the definition of multimedia object also allows single-media 
objects of the same media type to be used. The single-media objects in a multimedia 
5 object do not need to be synchronized in time. In the following, "single-media object" 
and "media object" are used interchangeably. 

The composing media objects (<Image_Object>, <Video_Object>, etc) inside a 
multimedia object are included in a media object set element (<Media_Object_Set>). A 
multimedia object element can also include zero or more object hierarchy elements 
10 (<Object_Hierarchy>) and entity relation graph elements (<Entity_Relation_Graph>) to 
describe spatial, temporal, and/or semantic relationships among the composing media 
objects. 

*Q Each multimedia object can have associated multiple features and corresponding 

J^^ feature descriptors, as discussed above. A multimedia object can include semantic 

'"4 15 information (e.g. annotations), temporal information (e.g. duration), and media specific 

m 

information (e.g. compression format). 

iU Two types of multimedia objects are used: local and global objects. A global 

multimedia object element represents the entire multimedia stream. On the other hand, a 

:S local multimedia object has a limited scope within the multimedia stream, for example, 

20 an arbitrarily shaped video object representing a person and a segmented audio object 
corresponding to his speech. To differentiate among local and global multimedia object, 
the multimedia object element includes a required attribute type, whose value can be 
LOCAL or GLOBAL. Only one multimedia object with a GLOBAL type will be 
included a multimedia stream description. 

25 Each multimedia object element can also have four optional attributes: ID, 

Object_Ref, Object_Node_Ref, and Entity_Node_Ref ED (or id) is a unique identifier of 
the multimedia object within the multimedia stream description. When the multimedia 
object acts as placeholder (i.e., no feature descriptors included) of another multimedia 
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object, Object_Ref references that multimedia object. Object_Node_Ref and 
Entity_Node_Ref include the lists of the identifiers of object node elements 
(<Object_Node>) and entity node elements (<Entity_Node>) that reference the 
multimedia object element, respectively. 

Below, an example showing how these elements will be used in XML is included. 
Fig. 3 includes examples of global multimedia objects (mmogO), local multimedia objects 
(mmoU, mmol2, etc), and media objects (aolgO, vogO, oO, etc). See Appendix B for the 
XML of the example in Fig. 3. 

<!-- A multimedia object set element --> 
<MM_Object_Set> 

<!-- One or more multimedia objects -> 

<MM_Object type="GLOBAL" id="MMobjt1" Object_Node„Ref="ON1" 
Entity_Node_Ref="EN 1 "... > 
<!-- A media object set --> 
<Media_Object_Set> </Media_Object_Set> 
<!-- Multiple object hierarchies --> 
<Object_Hierarchy> </Object_Hierarchy> 

<!-- Multiple entity relation graphs --> 
<Entity_Relation_Graph> </Entity_Relation_Graph> 

<!-- Zero or one multimedia object media feature element --> 
<MM_Obj_Media_Features> </MM__Obj__Media_Features> 
<!-- Zero or one multimedia object semantic feature element --> 
<MM_Obj_Semantic_Features> </MM__Obj_Semantic_Features> 
<!-- Zero or one multimedia object temporal feature element --> 
<MM_Obj_Temporal_Features> </MM_Obj_Temporal_Features> 
</MM_Object> 

<MM_Object type="LOCAL" id="Mmobj2" Object_Node_Ref="ON2 

ON 10" 

Entity_Node_Ref="EN2 EN4 EN7"...> 
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<!" Content of MM Object -> 



</MM_Object> 



</MM_Object_Set> 



The set of all the single-media object elements (<Video_Object>, 



<Image_Object>, etc.) composing a multimedia object are included in the media object 
set element (<Media_Object_Set>). Each media object refers to one type of media, such 
as image, video, audio, synthetic video, and text. These media objects are defined in the 
description scheme of the corresponding type of media. It is important to point out that 
10 the single-media objects and the multimedia objects share very similar structures at 
multiple levels although the features are different. 

In the same fashion as the multimedia objects, relationships among single-media 
objects can be described using media object hierarchies or entity relation graphs. 
Although entity relation graphs may lack the retrieval and transversal efficiency of 
15 hierarchical structures, it is used when efficient hierarchical tree structures are not 
adequate to described specific relationships. Note that media object hierarchies may 
include media objects of different media types. 

In the multimedia integration description scheme 210, each multimedia object can 
contain three multimedia object feature elements that group features based on the 
20 information they convey: media (<MM_Obj_Media_Features>), semantic 

(<MM_Obj_Semantic_Features>), and temporal (<MM_Obj_Semantic_Features>) 
information. Note that features associated with spatial positions of individual media 
objects (e.g., positions of video objects) are included in each specific single-media object. 
Spatial relationships among single-media objects inside a multimedia object are described 



25 using the entity relation graphs. Table 1 includes examples of features for each feature 
type: 
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Table 1 : Examples of feature classes and features. 



Feature Class 


Features 


Media 


Data Location, Scalable Representation, Modality Transcoding' 


Semantic 


Text Annotation\ Who\ What Object*, What Action^ Why*, When*, Where*, 
Keywords 


Temporal 


Duration 



1 Defined in Image description scheme. 



Each multimedia object feature element includes corresponding descriptor 
elements. Specific descriptors can include links to external extraction and similarity 
matching code. External document type definitions for descriptors can be imported and 
5 used in the current multimedia description scheme. In this fi'amework, new features, 
types of features, and descriptors can be included in an extensible and modular way. 

In Appendix A, the declarations of the following example features are included: 
Data_Location, Scalable_Representation, Text_Annotation, Keywords, and Duration. 

Each object hierarchy element (<Object_Hierarchy>) includes one object node 
10 element (<Object_Node>). An object hierarchy element can include a unique identifier 
as an attribute ID, for referencing purposes. It can also include an attribute type, to 
describe the type of binding (e.g., semantic) expressed by the hierarchy. 

At the same time, an object node element (<Object_Node>) includes zero or more 
object node elements forming a tree structure. Each multimedia object node references a 

15 multimedia object in the multimedia object set through an attribute, Object Ref, by using 
the latter's unique identifier. Each object node element can also include a unique 
identifier in the form of an attribute ID. By including the object nodes* unique identifiers 
in their Object_Node_Ref attributes, multimedia objects can point back to object nodes 
referencing them. For efficient transversal of the multimedia description, this mechanism 

20 is provided to traverse from multimedia objects in the multimedia object set to 
corresponding object nodes in the object hierarchy and vice versa. 
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The hierarchy is a way to organize the muhimedia objects in the multimedia 
object set. The multimedia objects in a multimedia stream can be organized based on 
different criteria: the temporal relationships, the semantic relationships, and the value of 
one or more features. The top object of each multimedia object hierarchy can specify the 
5 criteria followed to generate the hierarchy. An example multimedia object hierarchy is 
detailed in XML below: 

<!-- Object hierarchy element --> 
<Object_Hierarchy id=" "> 

<!-- One object node --> 
10 <Object_Node id=" " Object_Ref=" ..." 

<!-- Multiple object nodes -> 
<Object_Node id=" ..." Object_Ref=" ... " > 

</Object_Node> 

15 

</Object_Node> 
</Object_Hierarchy> 

Although a hierarchy is adequate for many purposes (e.g., A is the father of B) 
20 and is efficient in retrieval, some relationships among multimedia objects can not be 
expressed using a hierarchical structure (e.g., A is talking to B). For purposes of this 
discussion, the multimedia description scheme 210 also allows the specification of more 
complex relations among multimedia objects using an entity relation graph. 

An entity relation graph element (<Entity_Relation_Graph>) includes one or more 
25 entity relation elements (<Entity_Relation>). It has two optional attributes, a unique 
identifier ED, and a string to describe the binding expressed by the graph, type. 

An entity relation element (<Entity_Relation>) must include one relation element 
(<Relation>), zero or more entity node elements (<Entity_Node>), zero or more entity 
node set elements (<Entity_Node_Set>), and zero ore more entity relation elements 



# 



# 
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(<Entity_Relation>). An optional attribute can be included in an entity relation element, 
type, to describe the type of relation it expresses. 

Each entity node element references a multimedia object in the multimedia object 
set through an attribute, Object_Ref, by using the latter's unique identifier. Each entity 
5 node element can also include a unique identifier in the form of an attribute ID. By 
including the entity nodes* unique identifiers in their Entity_Node_Ref attributes, 
multimedia objects can point back to object nodes referencing them. For efficient 
transversal of the multimedia description, this mechanism is provided to traverse fi-om 
multimedia objects in the multimedia object set to corresponding entity nodes in the ER 
10 graph and vice versa. 

Hierarchies and entity relation graphs can be used to state spatial, temporal, and 
semantic relationships among media objects. Examples of such types of relationships are 
described below. In addition, the use of entity relation graphs is also described in more 
detail. 
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<Entity_Relation_Graph type='TEMPORAL.SMIL"l> 
<Entity_Relation type='TEMPORAL"> 

<Relation With__Respect_To="MediaObject1 "> 
<TennporaLSequential pattern="DELAY" /> 



</Relation> 
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<Entity_Node Media_0bject_Ref="Media0bject1" /> 
<Entity_Node Media_0bjeGt_Ref="Media0bject2" Start_Time="2' 



/> 
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<Entity_Relation> 

<Relation type="TEMPORAL"> 
<TemporaLParallel /> 



</Relation> 



<Entity_Node Media_0bject_Ref="Media0bject3" /> 
<!- Optional start or end time --> 



' In this example, SMIL temporal models are used. 
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<Entity_Node Media_Object_Ref="MecliaObject47> 
</Entity_Relation> 
</Entity_Relation> 
</Entity_Relation_Graph> 
5 <Entity_Relation_Graph type="SPATIAL.2DSTRING"> 

<Entity_Relation type="SPATIAL"> 

<Relation With_Respect_To="MediaObject1 "> 
<Spatial_Arrangement At_Time="3"> 

<Spatial_Relevance pattern="Upper_Right_Of' 

10 /> 

</Spatial_Arrangement> 
</Relation> 

<Entity_Node Media_0bject_Ref="Media0bject1" /> 
<Entity_Node Media_0bject_Ref="Media0bject2" /> 
15 <Entity_Node Media_Object_Ref="MediaObject3" /> 

</Entity_Relation> 
<Entity_Relation type="SPATIAL"> 

<Relation With_Respect_To="MediaObject2"> 
<Spatial_Arrangement At_Time=3> 
20 <Spatial_Relevance pattern="Lower_Left_Of " 

/> 

</Spatial_Arrangement> 
</Relation> 

<Entity_Node Media_Object_Ref="MediaObject3" /> 
25 <Entity_Node Media_0bject_Ref="Media0bject1" /> 

<Entity_Node Media_0bject_Ref="l\/ledia0bject2" /> 
</Entity_Relation> 
</Entity_Relation_Graph> 

The content of the relation could be particularized for each different scenario. In 
30 Appendix A, temporal, spatial, and semantic relations are included. Similar types of 

relations could be added as needed. Acceptable relationships for specific applications can 
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be defined in advance. The content of this element states the relation among the entity 
nodes included in that entity relation element. 

Many different types of relations can be declared among multiple objects in the 
object set; spatial (topological or directional), temporal (topological or directional), 
5 semantic are just some type examples. An example of entity relation graph was included 
above to show hov/ these structures could be used to describe temporal and spatial 
relationships among media objects. The same example is also valid for multimedia 
objects (Appendix B). 

Fig. 6 is an exemplary flowchart of the multimedia integration description 
10 process. The process begins in step 610, multimedia content is received by the global 
^ description generator 115 and one or more of the image description generator 125, video 

\Q description generator 130, audio description generator 135, synthetic description 

generator 140 and the text description generator 145. In step 620, the multimedia 
components are separated and at step 630 the single media event is classified within each 
'J 15 of the multimedia categories in the description generators 1 25- 1 45 . 

J, In step 640, the description generators 125-145 generate descriptions from each 

^ respective multimedia category which are then forwarded to the multimedia integration 

p description generator 165. In step 650, the multimedia integration description 

□ generator 165 puts the descriptions into the proper format using the multimedia 

20 integration description scheme provided by the multimedia integration description 
scheme unit 170. 

Then, in step 660, the multimedia integration description generator 165 integrates 
the multimedia descriptions and in step 670, the integrated descriptions are stored in 
database 160. The process then ends. 
25 While the invention has been described with reference to the embodiments, it is to 

be understood that the invention is not restricted to the particular forms shown in the 
foregoing embodiments. Various modifications and altemations can be made thereto 
without departing from the scope of the invention. 



